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WO 98/13521 PCT/EP97/05290 
METHOD FOR THE DIFFERENTIAL SCREENING OF GENE EXPRESSION BY 



RANDOM PRIMED REVERSE TRANSCRIPTION- POLYMERASE CHAIN 
REACTION 

The present invention concerns a method for the 
differential screening of gene expression by random primed 
Reverse Transcription- Polymerase Chain Reaction (RT-PCR) and 
a kit to be used for the performance of said method. 

The analysis of differential gene transcription 
focusses on molecular mechanisms involved in major bio- 
logical processes, such as cell differentiation, cell 
division, embryonic development and neoplastic transfor- 
mation. A multitude of techniques has become available in 
recent times to isolate differentially expressed genes. 
These techniques can be grouped in two classes: subtractive 
hybridization and differential screening. Hybridization- 
based differential screening and subtractive techniques are 
extensively covered elsewhere (1) . 

In 1992, Liang and Pardee first described a new, RT- 
PCR-based differential screening technique which they named 
Differential Display (DD) (2). In this technique, cDNAs are 
synthesized by means of anchored oligo-dT primers to select 
subsets within given mRNA populations. First strand cDNAs 
are subsequently PCR-amplif ied using the same downstream 
oligo-dT primer and an upstream random decamer. The complex 
PCR product is separated through a polyacrylamide gel and 
detected by autoradiography thanks to the incorporation in 
the PCR reaction of a radioactive dNTP . The technique aims 
at pinpointing bands corresponding to differentially 
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expressed genes from a background of ubiquitous, 
const itutively expressed products. Various refinements of 
the above protocol have been published (3-5) . An inherent 
limitation of the DD technique is the fact that products 
obtained from DD gels derive almost exclusively from 
noncoding regions of genes, making sequence analysis hardly 
informative. Furthermore, aside from the EST database, which 
mostly contains human sequence, only limited information is 
deposited into nucleotide sequence databases regarding 3 ' 
UTR regions. Consequently, a lengthy cDNA walk originating 
from the 3 • end of a transcript can - f rustratingly - result 
in the cloning of known coding sequence. Finally, problems 
can arise from the low sensitivity of the DD technique, 
which mostly identifies medium to high abundance 
transcripts, probably due to the use of low- complexity 
(oligo-dT) primers for PCR amplification. 

Around the same period of time, a different RNA 
fingerprinting protocol was developed by other authors (6), 
to permit internally primed PCR amplification of oligo-dT- 
primed or random-primed cDNAs. In this protocol, named RAP- 
PCR, only arbitrary primers are used for the radioactive PCR 
amplification step. Preliminary data (GGC, unpublished 
observations) clearly indicate that this procedure leads to 
greater sensitivity, improved amplification and cloning 
efficiencies, and that a significant share of cDNAs lie in 
coding regions, making their sequence analysis considerably 
more informative and allowing some degree of prediction to 
be made as to their nature and possible function. However, 
unlike DD, arbitrarily primed RNA fingerprinting has not 
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been systematized to maximize coverage of genes expressed in 
a given tissue or cell line within a discrete number of PCR 
amplifications. In other words, the main problem with RAP 
PCR lies within the unavailability of a rationally designed 
panel of primers permitting an exhaustive, nonredundant 
survey of gene expression in a given biological system. 

The present invention solves the above problem by means 
of a computer-assisted search for RNA fingerprinting primers 
characterized by high amplification efficiencies and a 
marked, nonrandom affinity for coding regions. The 
collection of reagents generated according to the invention 
allow the use of internally primed, PCR-based RNA 
fingerprinting as a reasonably simple, exhaustive and 
systematic tool for the analysis of differential gene 
expression, and as a workable, advantageous alternative to 
differential display. 

The method of the invention is characterized in that 
the PCR is carried out using a plurality of oligonucleotide 
primers the sequence of which has been determined by a 
method comprising the following steps: 

a) generation of random primer sequences having a CG/AT 
ratio of 2:1, no stop codon, no more than three 
consecutive identical nucleotides and no palindromic 5* 
and 3 1 ends ; 

b) screening of the primer sequences generated in a) by 
simulating PCR reactions on non-redundant mammalian 
nucleotide sequence databank entries containing at 
least 1,000 bp of coding region and calculating for 
each primer sequence their: 
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(i) efficiency index, said efficiency index being 
defined as the ratio of the number of PCR products 
comprising coding sequences obtained using said 
primer sequence to the modal number of PCR 
products comprising coding sequences obtained for 
each of the whole set of tested primers generated 
in a) ; and 

(ii) selectivity index, said selectivity index being 
defined as the ratio between the probabilities of 
yielding a PCR product comprising coding sequences 
or 3' untranslated regions; and 

c) selecting some or all of the primer sequences screened 
in b) according to their efficiency index and 
selectivity index for use in PCR. 

The invention also provides a kit for differential 
screening of gene expression in biological samples by means 
of random priming RT-PCT comprising: 

a) a plurality of oligonucleotide primers selected 
according to the above described method; 

b) reagents for the reverse transcription and 
amplification reactions; 

c) optionally, protocols for the cloning of the products 
of differential screening. 

The primers selected according to the criteria of the 
claimed method allow the detection of more than 80% of cDNAs 
containing significant portions of coding regions, compared 
with about only 10% of cloned products containing translated 
regions obtainable according to the prior art methods. 
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The present invention provides therefore a useful tool 
allowing easier recognition of new sequences as well as an 
easier comparison between known genes and the new genes 
cloned by the method of the invention. 
Brief Description of the Figures 

Figure 1. Histogram of the number of simulated PCR products 
(in CDS) per primer, tested on human nonredundant pseudo- 
cDNA database. Also shown is the expected distribution of 
number of products per primer, based on the probability of 
matching after randomly scrambling the sequences in the 
database (dashed line) . 

Figure 2. Scatter plot of the number of simulated CDS PCR 
products yielded by each primer when tested on the human 
(abscissa) or mouse (ordinate) pseudo-cDNA databases (454 
primers) . 

Figure 3. Exhaustivity and redundance analysis on simulated 
PCR (96 most efficient primers tested on human nonredundant 
pseudo-cDNA database) . Panel A shows the distribution of the 
number of simulated PCR products per transcript. Solid line: 
expected distribution, based on the probability of matching 
to the randomly scrambled sequences in the database. Dashed 
line: expected distribution for an increase in theoretical 
probability of matching by a factor equal to the ratio 
observed/expected mean number of products per transcript. 
Panel B shows the distribution of the number of different 
primers yielding simulated PCR products from each 
transcript. Dashed line: expected distribution after 
correcting the theoretical matching probability as in panel 
A. 
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Figure 4. simulated PCR using pairs of dodecanucleotide 
primers degenerate at the 3' end nucleotide, on a 
nonredundant pseudo-cDNA database consisting of 2560 
transcripts (5.4 million basepairs) . Distribution of the 
number of different primer pairs yielding simulated PCR 
products from each transcript. Solid and dashed lines are 
the expected distributions before and after correcting the 
theoretical probability of matching by a factor equal to the 
ratio observed/expected mean number of products per 
transcript . 

Figure 5- Correlation between the number of simulated PCR 
products and the number of bands in experimental gels. 
Simulations performed on mouse nonredundant pseudo-cDNA 
database, using 12-nt primers (8 C-G, 4 A-T) . Thirteen 
primers were arbitrarily chosen among those yielding low, 
medium and high numbers of simulated PCR products; PCR 
experiments were performed as described in the Methods and 
the numbers of clearly discernible bands were recorded. The 
line illustrates the least-square linear regression on the 
data through the origin (vertical bars = S.D., r = 0.87). 
Figure 6 shows sample gels obtained through computer-driven 
fingerprinting (RF) experiments. Uninduced Hep G2 cells and 
some line induced with reducing agents are compared. RNA 
extractions, RT reactions and PCRs are done in duplicate. 
Ul, U2: uninduced; 11, 12: induced. 
Detailed Description of the Invention 

The random sequences described in a) of the above 
method can be generated easily using straight-forward 
computer algorithms. 



WO 98/13521 PCT/EP97/05290 

J 

Simulation of PGR reactions as in b) of the above 
method can be performed by, for example, searching both 
strands of the target sequence for a sequence complementary 
to the primer sequence, permitting varying degrees of 
mismatch, for example 3 mismatches. A PCR product is scored 
if a suitable match is found on both strands and the 
matching sequences are within a predetermined distance from 
each other, for example from 100 to 1,000 bp apart. 
Searches can be performed using using any of several 
commercially available software packages, such as 
FIND PATTERNS in the Wisconsin GCG package. 

Non-redundant nucleotide sequence databases are used to 
provide target sequences for PCR simulations. Nucleotide 
sequence databases are easily accessible to skilled person, 
two of the largest and most well-known being the Genbank and 
EMBL databanks. From these databanks, a subset of sequences 
are selected. Only sequences containing at least 500 bp, 
preferably at least 1000 bp, of coding sequence are 
selected. Furthermore redundant sequences are eliminated. 
That is to say, for any given gene, often more than one 
entry occurs in the databank and it is desirable to select 
only one of the entries if they all have very similar 
sequences. One method of achieving this is to compare 
databank entries which have a common word in their sequence 
descriptions with a sequence comparison program, for example 
FASTA, and eliminate the shorter sequence if the two 
sequences have sequence identity above a percentage 
threshold, preferably greater than 95%. It is also 
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desirable to eliminate intron sequences from genomic 
sequences to produce a contiguous cDNA sequence. 

The oligonucleotide primers may be of any length and 
preferably comprise from at least 10 to 20 nucleotides, for 
example they may consist of 12, 15 or 18 nucleotides, more 
preferably 12 nucleotides. Primers may contain additional 
groups such as labelling groups, for example biotin, or 
radio-labelled substituents . Primers may be synthesised 
using standard methods known to those skilled in the art, 
for example using an automated oligonucleotide synthesiser. 

In order to obtain more readable PCR- amplification 
gels, the efficiency index as defined above and used as one 
of the two criteria for selecting candidate primers should 
not be either too low (preferably > 2) or too high 
(preferably < 10) . The selectivity index as defined above 
and used as the other criterion for selecting candidate 
primers is preferably higher than 1, more preferably higher 
than 1.8, even more preferably higher than 2. 

Some primers may produce PCR products containing both 
coding sequences and untranslated regions. However, the 
amount of coding sequence in some cases may be very small. 
Therefore,. when determining efficiency and selectivity 
scores it may be preferable to only consider a primer as 
having yielded a product within a coding sequence if the 
amount of coding sequence within the product exceeds a 
predetermined percentage, for example 10, 30, 50 or 70%, or 
a predetermined length (e.g. 50, 100 or 200 nts) . 

The set of primers selected according to the method of 
the invention described above may be further selected from 
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to produce a smaller set of primers. This can be 
accomplished by simply selecting primers with the highest 
selectivity indices. For example, if 500 primers are 
selected after steps a) to c) , it would not be necessary to 
synthesise all 500 primers for use in PCR techniques. A 
skilled person may only select, for example, from 10 to 100 
primers. Generally, a skilled person would select primers 
with the highest selectivity index and efficiency index 
except that they may discard any primers with sequences that 
are too similar to other selected primers, for example if 
they are greater than 80% identical overall (or greater than 
60% identical in the last 8 nucleotides at the 3 ! end) . 

The kit described above will typically contain about 
from 10 to 20 0 primers, or pairs of primers degenerate at 
one position (e.g. the last nucleotide at the 3' end) 
preferably from 20 to 100 primers (or pairs) , for example 
30, 60 or 96 primers (or pairs), selected by the method of 
the invention. 

The method of the invention is hereinafter described, 
by way of an example, in more detail with reference to 
specific databases and experimental conditions, 
a) Simulation of mRNA PCR in nucleotide databases 

PCR simulations were- run on two nonredundant (nr) 
databases, obtained from a combination of human or mouse 
sequences deposited into the Genbank and EMBL nucleotide 
sequence databanks (accessed through the GCG Wisconsin 
package, version 8.1-UNIX, August 1995) (7), using one 
arbitrary 12 -nt primer sequence at a time, thus assuming 
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each primer to anneal in a degenerate fashion to the sense 
and ant i sense strand. 

The reduced human and mouse databases were obtained by 
selecting human or mouse sequences containing at least 1000 
bp of coding region (CDS) . In order to decrease redundance, 
variable regions of immunoglobulins and T-cell receptors 
were eliminated, and all pairs of sequences sharing a word 
in their product descriptions were compared by the FASTA 
algorithm (8) ; the shorter one was eliminated when >95% 
identical to the other. Intronic regions were eliminated 
from genomic sequences, generating new transcribed sequence 
files containing uninterrupted cDNA. 

Annealing of the primers was simulated by searching 
both strands for the sequence of each primer in the nr 
databases by means of the FIND PATTERNS program in the 
Wisconsin GCG package (7) , permitting a maximum of 3 
mismatches. All pairings with one or more mismatched base(s) 
among the last 4 (at the 3' end) were excluded as unsuitable 
to prime a polymerase chain reaction (PCR) . A simulated PGR 
product was scored whenever a pairing occurred on the sense 
strand and, 100-1000 bp downstream, on the antisense 
strand. 

For each primer, simulated PCR products were tagged 
with a CDS flag, if they contained a coding sequence 
portion, a UTR flag, if they contained a portion of 3' 
untranslated region. Each primer could be assigned an 
"efficiency" score (total number of simulated PCR products 
in the sequence database) and a "selectivity" score (ratio 
of the probabilities of yielding a PCR product comprising 
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coding sequences or untranslated -3' regions). A crucial 
aspect in assessing the validity of the approach proposed 
here is to exclude the possibility that differences in 
"efficiency" observed among random primers are due to 
chance. Thus, the distribution of the number of products 
per primer, obtained through the simulation experiments, was 
compared to the one expected for random primers and random 
sequences, considering that (i) all primers were constituted 
by 8 G/C and 4 A/T, (ii) each sequence in the databank had a 
certain proportion of G/C and (iii) perfect match was 
required for 4 bases at the 3 ' end whereas up to 3 
mismatches were allowed over the first 8 bases from the 5 1 
end of each primer. The computation of the expected products 
number distribution is described in the following section, 
b) Computation of the expected distribution of PGR 
products per primer 

Let the databank, D, be a set of N sequences, 

D = {S s \se[l,N]}. 

Given that S s is a sequence composed of a s C/G 
nucleotides and b s A/T nucleotides, the probability of a G 
or C nucleotide in the primer matching an arbitrary 
nucleotide in the sequence is 

and the corresponding probability for A or T is 
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2(a s +b s ) 

In order to obtain hybridization a certain degree of 
matching must be obtained; here we arbitrarily decided that 
hybridization would occur for at least 9 matching bases out 
of 12, with no mismatches within the last 4 bases at the 3' 
end. Under these conditions, the probability of 
hybridization for a given template sequence and a given 
primer is a function of the fraction of C/G nucleotides in 
the sequence, F s = a s / (a s +b) , the number of C/G in the first 
B bases of the primer, n lf and the number of C/G in the last 
4 bases at the 3' end, n 2 . 

For any specific alignment of the primer on the 
template S s , we have*. 

p[baseU) of the primer matches I F s ] = [ FJ1 baSe{j) G {C G} ] 

JJ l(l-F,)/2 baseU)e{A,T}j 

p[at least 5 out of the first 8 bases match I F^N^ = 

J^p[j matches out of /i, (Cv G)] -p[at least 5 - j matches out of 8 - ^ (A v T) ] 



*=5-y v 2 y V 2 J 



p[all 4 bases at the 3* end match I n 2 ] = (j^T* -f* "* 



so that the probability of hybridization for any 
specific alignment of the primer on the template is P s = A s 
• B s . The average value of F s was 0.53 (± 0.082) and in 
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general P s was about 1-2 -10" 4 , its value increasing for 
primers with increasing numbers of C/G nucleotides in the 
last 4 positions* 

Assuming that PCR products of interest would have a 
length, L, comprized between L 0 = 100 and Li = 1000 BP, then 
the number of combinations of two acceptable positions on a 
template sequence S s , of length M s is: 



This gives rise to a binomial distribution of the 
number of PCR products obtained from template sequence S s of 
length M s and a primer with given values of n 2 and n 2 . Such 
distribution is defined by the binomial parameters p = P 2 S 
and n = C s ; the corresponding probability density function 
(p.d.f.) is: 

Psto = C£ .p*.(\-p s ) c s- x 

5 

Actually, to estimate the probability of obtaining 
simulated PCR products (neglecting the technical aspects 
connected to experimentally obtaining a PCR amplification 
product) , the unwanted possibility of a further 
hybridization in between the two valid positions must be 



WO 98/13521 PCT/EP97/05290 

excluded. For a product of length L, this possibility has an 
approximate probability of 

{l-(l-f> t ) 1L )-2L-P„ 



and therefore about 900 • P s for the average product length of 
450 bp. For the usual magnitude of P s this factor amounts to 
about 10" 2 and can be neglected. 

The distributions of PCR products from the N sequences 
in the database 

{PsM\S s sD} 

are expected to be independent. Therefore, the corresponding 
characteristic functions (ch.f .) can be computed 

{<p,(u) l^eD} 

and the ch.f. for the whole databank will simply be: 



<p D (u) = cxp 



2>g[(p»] 



v s 



From the expected p.d.f. of the number of PCR 

products from the whole databank, Pd(*)> is computed for 
each primer. Averaging over the set of primers yields the 
expected distribution of the number of PCR products per 



WO 98/13521 PCTYEP97/05290 

/r 

primer (P^) . Notice that Pd< x ) is necessarily equal for 
primers having the same values of n lf ri2 and P s . Thus, Pi 
may be multimodal (up to 5 peaks for n 2 = 0 to 4) . 

The same procedure is used to compute the expected 
p.d.f. of the number of PCR products from each sequence, P2 
(in this case the ch.f. is computed by summing the logs of 
the single ch.f.'s over the set of primers for the same 
sequence) . 

A third distribution of interest is that of the number 
of "successful" primers per sequence (i.e. yielding at least 
one PCR product from the sequence), P3 . This is computed in 
the same way using the modified p.d.f w 



71., such that:< 



Wo) = />,(()) 



Distribution P^. ^ s use <3 to check whether the observed 
distribution of PCR products per primer significantly 
departs from the expectation: if a marked excess of 
particularly "good" and "poor" primers are found, this 
argues against a purely random distribution of nucleotides 
in the sequences of the databank. 

Distributions P2 and P3 yield information on the 
exhaustivity of the approach, i.e. the capability of picking 
out as many different sequences as possible. In particular, 
the shape of the p.d.f. P3 can be compared to the 
corresponding distribution, obtained by the simulation 
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experiments, to check whether any bias is present towards a 
subpopulation of sequences (i.e. whether some sequences are 
significantly more subject to amplification than others) . 

This approach cannot be straightforwardly applied to 
the distributions obtained using sets of particularly 
efficient primers. These distributions are obviously shifted 
to the right, with respect to the p.d.f. P 3 , which is 
computed based on a purely random nucleotide composition of 
the databank. The size of the shift is conveniently 
represented by the ratio of the mean values. If the shift 
simply reflects an increased hybridization probability with 
no bias towards sequence subpopulations , the shape of the 
curve will be easily reproduced by computing the logarithm 
of the characteristic function of the expected probability, 
multiplying it by the ratio of the means and computing the 
resulting probability distribution (this is performed by 
applying the direct and inverse fast Fourier transforms) . A 
reasonable agreement with the observed distribution will 
argue against biases in favour of specific sequence 
subpopulations . 
c) RNA fingerprinting 

Reverse transcription is carried out using a (dT)16 
primer on 1 mg total RNA extracted by the caesium chloride 
method (9) . Radioactive PCR reactions, in duplicate, are 
performed from 2 |il of each RT reaction in 50 jil final 
volume with arbitrary 12-mers (final cone. 4 mM) , using 
Perkin Elmer 1 x Amplitaq polymerase + MgCl2 [1.5 mM] . PCR 
conditions are 3 minutes at 94°C, 2 minutes at 80°C at which 
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Taq polymerase is added (hot start) , followed by 3 5 cycles 
of 40 sees, at 94°C, 1 min. at 50°C, 1 min. at 72°C / with a 

final elongation step of 5 mins at 72°C. 0.2 \xl [a- 32 P]dCTP 
are added to each reaction. Amplified products are separated 
on a 5% denaturing polyacrylamide gel and visualized by 
autoradiography. Differentially displayed bands are cut from 
the gel and electroeluted in dialysis bags as described (2) . 
Bands are reamplified using the same 12-mer primers and 
cloned into a modified pBluescript II SK+ (Stratagene) . 
Clones corresponding to differentially displayed bands are 
selected from the background of unrelated products. 

Sequence analysis of cloned products. - Data bank 
searches (Genbank, GeriEmbl, SwissProt and PIR) are run 
through the BlastN and BlastX network servers (10) . 
Additional sequence analysis and contig assembly is done 
using the GCG package. 

d) Simulation of RNA fingerprinting PCR in human and 
murine nr nucleotide databases 

In a first series of simulations (MAN12.8) 10,000 12- 
character strings were generated randomly, to represent 
dodecanucleotide primers, with the initial requirement that 
they contain 8 C or G and 4 A or T. Primers containing 
either stop codons (TAA, TAG, TGA) in the sense strand or >A 
homonucleotide stretches { AAAA, CCCC etc.) were discarded 
(criteria a and b) . Also discarded were primers with 
palindromic 5' and 3* ends (>4 successive complementary 
bases (criterium c) or containing >5/8 bases at the 3' end 
identical to a previously accepted primer (criterium d) . The 
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above criteria were aimed at biasing the primers towards the 
CDS (a) , at enhancing the efficiency of PCR experiments (b 
and c) and at reducing the chance of targeting the same 
sequence repeatedly (d) . Finally, the last position (3 1 end) 
was partially degenerate, consisting of a W (A/T) or of an S 
(C/G) . 

A series of 1000 acceptable primers (according to 
criteria a, b and c) were challenged against the human nr 
database to measure their efficiency (total number of 
simulated PCR products) and selective affinity for coding 
portions of transcripts. Only primers yielding >100 
simulated PCR products out of 2 , 085-sequences of our human 
nr database (6.04 Mb total DNA, 72.8% CDS) were included in 
the "good primers" list (158 primers) ; 497 primers were 
discarded because of their similarity to previously included 
primers (criterium d) . The remaining 34 5 primers yielded < 
100 simulated PCR products. 

Figure 1 illustrates a histogram of the number of simulated 
PCR products obtained from each primer in this series (503 
primers) . Also illustrated is the probability density 
function (%) expected based on the probability of matching 
to the randomly scrambled database sequences (see Section b 
for details on the computation of this curve) . It can be 
clearly seen that the observed distribution does not fit a 
random distribution of bases in the sequences, and that the 
extreme shoulders of the distribution curve are markedly 
overcrowded. This indicates a large excess of particularly 
poor and particularly efficient primers, and points to the 
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possibility of selecting efficient PCR primers based on the 
present "simulated gene fishing" approach. 

The simulation was repeated (MOUSE12.8) by testing the 
same series of primers on the mouse nr nucleotide database, 
containing 1041 sequences comprised of 2.95 Mb total 
nucleotide sequence, (72.5% CDS), in order to check whether 
species-specific features in sequence composition played a 
major role in determining the efficiency scores of different 
primers. In this case 193 "good " primers and 326 

inefficient ones were obtained; 481 primers were not 
considered due to criterium d. The distribution of the 
numbers of simulated PCR products per primer was 
qualitatively similar to the one obtained in the human 
database (not shown) . 

Figure 2 is a scatter plot of the number of simulated PCR 
products obtained in the human (x-axis) vs. the mouse (y- 
axis) database, with the same primers (454) . The two sets 
of values correlate very well (correl, coeff. r = 0.947), 
indicating that the unexpectedly high or low efficiencies of 
some primers did not arise from aberrations in the 
composition of the particular database used for the 
simulation experiments, but rather from intrinsic 
differences in efficiency among primers. In other words, it 
appears that some "genetic strings" (the base composition of 
some particular oligonucleotides) have particularly high or 
low probabilities of occurring in mammalian coding 
sequences . 

One important aspect regards the exhaustivity of the 
present approach. The primers selected because of their 
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high efficiency in yielding simulated PCR products may be 
directed towards subpopulations of genetic sequences. If 
this were the case, the number of sequences not picked out 

(sequences from which no PCR products are obtained) should 
be higher than expected on the basis of the average number 
of primers giving rise to products in each gene. 
Figure 3 illustrates this point; in particular, it shows the 
distributions of the numbers of PCR products generated by 
each primer in the human database (A) , and the distribution 
of the numbers of different primers yielding at least one 
PCR product from each transcript (B) . Data are relative to 
sequences containing at least 1000 bp of coding region. The 
dashed line in B represents the expected distribution of 
numbers of primers (among the 96 selected here) picking out 
each sequence, given the average value of such distribution 

(see Section b) . The observed distribution reasonably agrees 
with this expectation; in particular, the percentage of 
sequences not picked up by any primer is reasonably well 
predicted. These graphs suggest that the sets of "efficient" 
primers proposed here are capable of targeting the entire 
population of sequences deposited into databanks. 

In order to further check for possible biases in the 
procedure, the same kind of simulation was performed using 
12-nucleotide primers composed of, 6 A/T and 6 C/G (MAN12.6, 
MOUSE12.6). Again, the primers displayed a wider range of 
efficiencies than expected, and again the distribution of 
the numbers of primers yielding products from each gene was 
in reasonable in agreement with the predictions of a random- 
interaction model. As reported below, in actual experiments 
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stringency conditions had to be very relaxed in order to 
obtain a reasonable number of PCR products using 12- 
nucleotide primers. Better results were obtained by using 
degenerate pairs of primers. As this approach might as well 
increase the exhaustivity of mRNA fingerprinting/ we 
repeated the simulation using 12-nucleotide, 8-C/G primers, 
containing a partially degenerate base (W or S) at their 3* 
ends. This simulation was run on an updated database, and 
the results were similar to those reported above. Again, 
"poor" and "very good" primers are present in large excess 
with respect to the expectation for a random- sequence 
database. A set of 96 "best" primers was selected as those 
displaying an efficiency index comprised between 2 and 15 
and a selectivity index above 1.4. Figure 4 shows that the 
best 96 primer pairs yield a distribution of number of 
primers spotting each sequence which is shifted to the right 
with respect to the expectation (solid line) , suggesting 
that this subset of primers is particularly efficient; the 
distribution is well fit when the expectation is corrected 
for the mean efficiency of this particular set of primers 
(dashed line) and the number of sequences yielding no 
simulated PCR products. 

d) Experimental assessment of the primer panel 

As a result of the elaboration described above, a panel 
of 120 optimal degenerate primer sequences (Table 1) was 
generated. All these primer pairs yielded E.I. comprised 
between 2 and 20 and S.I. > 1.4. Thirteen primers (some 
belonging to this panel and some not) were synthesized and 
tested at the bench, to assess the correspondence between 
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theoretical predictions and experimental results. As 
templates, we utilized oligo-dT-primed cDNAs obtained from 
various mouse and human cell lines. The results are shown in 
Figure 5 . 

Exhaustivity - So far, all primers have displayed 
amplification efficiencies coherent with those expected 
based on our simulation. Ten degenerate primers tested so 
far on a HepG2 cell line total cDNA, have produced patterns 
containing 80 to 167 bands (mean: 114.4 bands/gel) (Figure 
6) . NNN oligos predicted to amplify inefficiently in 
mammalian cDNA were also tested, and two gave very poor 
banding patterns (fewer than 30 bands) . Extrapolating these 
preliminary data, it may be inferred that a differential 
screening study employing the best 120 primers should permit 
a survey of 10982 bands. An estimate of the coverage 
provided by this figure is dependent upon the degree of 
redundance and the complexity of the gene pool in each given 
tissue or cell line. At this stage, an experimental 
assessment of redundance is of limited significance, due to 
the small number of products analyzed so far. To date, 
however, the ten cDNAs cloned with this new set of primers 
came from ten distinct genes. 

Selectivity for coding regions - cDNAs . cloned from 
these experiments contained ORFs throughout their lengths in 
9 out of 10 cases. Of the nine ORFs, five corresponded to 
known coding regions (known genes or orthologs of known 
genes) . Of the remaining 5 ORFs, four were rated "excellent" 
by the GRAIL program [11, 12], predicting coding regions 
within them; finally, although the tenth clone was n 
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recognized as 'coding 1 by GRAIL, it also contained a 160 bp 
ORF at one end. 

Accuracy - As far as the correspondence between RNA 
levels predicted through our intenally primed RNA 
fingerprinting protocol and actual transcript levels, out of 
28 genes cloned with this method with the old 
(nondegenerate) set and new (degenerate) set of primers, 25 
have been found differentially expressed as expected. RNA 
studies have been performed using cloned "differentially 
displayed" bands as probes or sources of primers for 
Northern analysis, RNase protection assay, quantitative RT- 
PCR) . 

Sensitivity - Besides assessing the above issues, we 
tried to determine whether our internal primers exhibit a 
preference for highly abundant transcripts (sensitivity) in 
at least two cases, cDNA clones generated by our method 
produced no hybridization signal by northern analysis of 3 0 
mg total RNA. In two cases, cDNAs found by RT-PCR in a given 
tissue have required plating of over 10 6 pfu to permit 
isolation of the corresponding gene by hybridization-based 
screening of the appropriate library. 



WO 98/13521 



24 



PCT/EP97/05290 



^ n ^ CD i£> n r-< 0> co w r-< o in • cn <«o cn o io cm in <«r t in tt ro in rr r- in cn r-* CO id * 00 cn 
t r-i «— • «— ' *— • 

. tt ro n r> c\j ct» a> on «3 vo tt <n ri ^ aicna\ooaDrvvotDuD^u3Lnui(xiunroror)^«'-« 

ocoocJcDooocot— cj<cj(— t— a<ut3H oouu^- oi-wuuhuhuhui-tDua 

UUOC3<JUUC3U<l3hl3<<Uht5U<OUO(3<U<UaO<0<U<a<UC3U 
O<<<(-OhUOU<OO<U0UC5t5UC5OUUUC3U<Uht3hUOUO|-<hU 
<U h- UUO<C3<OOUU<haOhUUhO(-OCIUUhCl<0< <UUh HC3UUOC3 
U CD UKOU<Uh<KOa<t3<U<U<t3t-Ol-0<oaOUUUaO)t-l-<e>J-C3H 
C CJ <Ot30UCtUUC3HOe3UUht5c3a<Uhl-KUC5l-<<OUUU<UU<<h-< 

uOI-U<<O<<OtJUU(-U<O(J<:<UI-U<UO<OH-t~<Htpe)CJ<OOt0< 
CT J-ht-<UOOUU<C3Uh<0 0<<UCJO<h<h<UOhOUOUO<OI-<t3«3 
<U O OU<UU<UhU5hhUOhUUOUUhhUUt3CJUaci»-OUOt-UI-UUh'< 
irt CD UOOaha(3UUOUC3 0CJUCD<0<UC!3UUO(3l-U(DUhU4:ua<UU(30 
l-UO<|-UOOUt-UUOUUa(-<Ut3l3COOt30(-UOOUU<<C30C30UOU 
UU(5UOO<l-hO|-t5l-t3<l-UOhOCHflUUUC3 0^hauO<C3l3auU<(J 

. ^vncy»no»woi/)nfvjo cmco m in o r-*. o csioo cn co \o cn o 

a: ,^cMCO.^^^a^u^^^^cMLnoo.^«^c^u3f^^rOcOro^ 



. o\ c\i < i . — < 0\ Csj CM \£D GF\ CO ^ r^. C**i C<J <J\ *T V") C^} r~-~ -*t CO ' — • "»Lr>f^.COCnr-4C\jr-^U") 

ih o cm f— i ao r-*. oo co « * ^ *o cn cn o CO in cn co *( v Knooifirs.<M^a»Ktr)rvtn<Mniocg--iin^ 
cu in cNjcgcMui^^cnt^c8O^t^i^tnf^^C0u^cM^00^*Or^ 

, co in ro cn oo 00 cm cn ro ro cm cm «— • cn oo /-^ r-^ m cm cm cm .— * cn cn m cm #— • ^ co 00 oo u> m 

. cM^^oooooo^mcncncncncncncococooocococ^ 

C3(3(5O«l<CJ<U<U<l30l3<<<l-WUtD(5<UUWUUUUt90hUl3Ut30U 
hC3C7C5UU<lflU(JtflhU(fl<U<<at3<U<UhUOOU<UL)«JUO<hOClCJ 

cu h~ hut3(-uuu<aac3h<u<<<cjouua(-<:i-<k-(-i3<auh<(50h<u 

U CJ C!JtflUU<CJHOhHtD<(5U<OU<(JCJO<CflC9U<<<CDiJI-l-Ot30Kh(i)t- 
C <(J<thUUUUOC3<t!JOOHOC3UOH<<<OC3UGO(flOt3C3h-0(30t300tD 
ttl 1— f-<tD<UU<UU<ODtJ)tDc3tfll30hU<eJO<UO<yJ<<Ut3UI-UUhO< 
3 UHU<lflKHOOOUO<HKI-UUl-KHOUOOl3UOUU<k<t3C)<Ot3<C3 
CT CJ hU<U<hU<HUf-OU<UOUtDUUOf-Ut3hUUUI-<e)UUU<<UhU 
<D i3OOUOU<U<(3OUOI-U(Jl3C)t-t3CJh'O(3OCJ)t3t3H(£ICJC3C3UOC9U0O< 
vi Ul30C3t3dO<U(-hUO(JC3UOI-UUHHahU<hOUOCJhhh<OU<U< 
OO)— <<OWUUOOhOOUOh OOOOCJCJOc0OCJ'<CCJ-<CJH-OOO-<<O'<CJ 
(JO<fU<t3(3C3<UO(3<<UUaC)U<C3U(iJh<<UUh-UL)UUh<l500UU 

. CO OUXOWONCDON-Ort O CO lf» o>u> O rv. <- CM Lrt CSJ (\J f— i ^1" 00 CO C*J CM VO ID tn O 00 

o ^r^or^r^^noc\jcnroroLn<\j<x>rovo^LO«^cT»co 

T^ioeM^^^^co^«^tMCO^cMCMino0^tMin^cj^incn 



. ^-Ktovointon^^H c\j cm an \o «-* eg oo tn 00 u>r-»fsvoni^HNHrviv co oo •— « co cn 
oo ocno^cMOr^^*ocncocneooocorocn<riocMcn^coi^ 

uj ^cMi^o*j(MCMCOincMr^iocMro»^cocvj^c\j^ii^ 



■ cn cn <t co in cvj c\ (v 
< ro cn in tt co <o co co ^ 



m co co co ro cn co io "*r oo f*». *r co fv. cm cn in ir> in in ro f— * co r-* to 
i^^rHOOC7icors^yD^uiuiinin^^ronnnnnnrnrocvj(\i(\j(Nj 



i/> io ^^^rorococococororococor^cMCMCMCMCMCMCMCM^ 



(5UUI~(-OUUOCJ<GC3UC3l — CD -< CD I — CJ CJ J — Ot30tDt3UUC3UOO(JC3<L)t3< 





o 


O 


o 


< 


<. 


O 


O 


o 


CO 


O 


»- 


CJ 


CJ 


C3 


o 


< 


o 




CJ 




o 


o 




CJ 


CJ 


CJ 


CJ 


o 


CO 


CO 


CJ 


CO 


o 


CJ 


CJ 


CJ 




CO 


CJ J — 




cj 


CJ 


CO 


co 


CD 


CO 


1— 


o 


o 


CJ 


<_> 


CO 




t— 


CJ 


CO 




co 


CJ 


CO 


k— 


CJ 


CJ 


CO 


o 


CJ 


CJ 


CJ 


CJ 


CO 


CJ 


CO 




CO 


CO 


o 


CJ 




CJ < 




-< 


O 


cj 


t_> 


o 




o 


o 


CJ 


< 


CJ 


o 




CO 


to 


CJ 


o 


CD 


1— 


o 


CO 


t— 


CJ 


CJ 


CO 


CJ 


CO 


CO 




V— 


CJ 


CJ 




o 


*— 


CD 


CO 


< 


CD CJ 


o 


CO 


cj 


o 


cj 




CO 


o 


o 


o 


CO 


■< 


CJ 


C3 


CO 




CD 


CO 


CJ 


o 




CJ 


CD 


-c 




CD 


CO 


CO 


\ — 


< 


CO 


CO 


<: 


CJ 


o 


CO 


CJ 


CO 


CJ 


1— H- 


c 


CJ 


l— 


h- 


CJ 




< 


o 


o 


< 


CO 


<c 


< 


CJ 


CJ 


< 


CO 


CJ 


CJ 


H- 


1— 


CJ 


CD 


CO 


»— 


CJ 


J — 


»— 


CJ 


CJ 


CJ 


CO 


CJ 


<< 


< 


CJ 


CO 


CJ 


»— 


< CJ 


a> 


I— 


o 


J- 


o 


*— 


o 


o 


< 




< 


CJ 


CJ 


■< 


< 


C3 


o 


CJ 


CO 


■< 


-c 


o 




*c 


CJ 


t— 


CD 


1— 


CO 


< 


1— 


CO 


CO 


< 


CJ 


CJ 


CO 


*- 


CJ 


CO CJ 


3 


<< 


CO 


<c 


H- 


-< 


-< 


o 






CJ 


CO 




CO 


CO 


CJ 


■< 


CJ 


<: 


CJ 


CJ 




< 


*c 






CO 


CO 




CO 


CJ 


CJ 


CO 


o 


CO 


CJ 


> — 


<. 


CJ 


CJ h- 


CT 


o 




o 


<: 


o 


t_> 


»— 


CJ 


to 


o 


CO 




CJ 


CJ 


o 


CJ 


)— 


o 


< 


CO 




CJ 


o 


o 


< 


V— 


CJ 


< 


CO 


CO 


< 


CO 


CO 


H- 


< 


CO 


Y- 


< 


CJ CD 




CJ 


CJ 


< 


CO 


o 


J— 


<: 


O 


CO 




o 


CJ 


o 


t— 


o 


o 


CJ 


LO 


CJ 


CJ 


o 


t— 


CJ 


CO 


CJ 


h— 


-< 


CO 


CJ 


< 


I— 


■< 


CD 


CJ 






CJ 


CO 


h- CO 




o 


o 


cj 




<: 


< 


o 


CO 


V- 




*— 


CO 


CO 


CO 


< 




»— 


1— 


CO 


CJ 


<: 


o 


CJ 


«c 


o 


CO 


CO 


< 


CO 


CJ 


<c 




CJ 


o 


o 




o 


CO 


CJ CO 






< 


o 


t_> 


CJ 


o 


o 


1— 


CJ 


CO 


CD 


-< 


«£ 


CJ 


CJ 




h- 


CO 


o 


CO 


CJ 


CO 


CO 


CJ 


CD 


I— 


CO 


CO 


CO 


o 


<c 


CJ 


CJ 




CO 


o 


CO 


CJ 


CJ 




< 


-< 


o 


o 


o 




I— 


-< 


H- 


)— 


CJ 


CO 


o 


h~ 


-< 


< 


CJ 


1— 


CJ 


CO 


CD 


CJ 


CO 


CJ 


< 


CO 


1— 


CD 


< 


<. 


o 


< 


o 


< 


< 


fr- 


CJ 


CO 


CJ CJ 




CSJ 


CO 


CM 




00 






r*. 










o 


Csi 




CO 




cn 




CO 






cn 


o 


CO 




o 






in 






o 


ro 


cn 


m 


«— 1 


CM 


cm 


O 




CO 


CM 


CO 


CM 




m 






CO 


CO 


1 


00 


o 


CO 


«T 




CO 


m 


m 


r*^ 


cn 


o 


an 




o 


GO 


«5T 


CM 


cn 


CM 


1— ^ 


CO 


o 


CM 


It 


o 


o 


i ro 




cn 


u> 






ro 




cn 




<M 


—* 


co 


CO 






*M 




en 








cn 


fM 


CO 


co 




CO 




tM 


CO 




in 


in 


rsi 




in 




rs 


CO 


CM CM 



SUBSTITUTE SHEET (RULE 26) 



WO 98/13521 PCT/EP97/05290 

2T 

REFERENCES 

1. Ausubel, F.M., Brent, R., Kingstone, R.E., Moore, D.D., 
Smith, J. A. and Struhl, K. (1995) Current Protocols in 
Molecular Biology. 

2. Liang, P. and Pardee, A . B . (1992) Differential display 
of eukaryotic messenger RNA by means of the polymerase chain 
reaction. Science, 257, 967-971. 

3. Bauer, D. , Muller, H. , Reich, J., Riedel , H., 
Ahrenkiel, v., Warthoe, P. and Strauss, M. (1993) 
Identification of differentially expressed mRNA species by 
an improved display technique (DDRT-PCR) . Nucleic Acids 
Research, 21, 4272-4280. 

4. Liang, P. (1993) Distribution and cloning of eukaryotic 
mRNAs by means of differential display: refinements and 
optimization. Nucleic Acids Research, 21, 3269-75. 

5. Liang, P. (1994) Differential display using one-base 
anchored oligo-dT primers. Nucleic Acids Research, 22, 
5763-5764 . 

6. Welsh, J., Chada, K. , Dalai, S.S., Cheng, R . , Ralph, D. 
and McClelland, M. (1992) Arbitrarily primed PCR 
fingerprinting of RNA. Nucleic Acids Research, 20, 4965- 
4970. 

7. Devereux, J., Haeberli, P. and Smithies, O. (1984) A 
comprehensive set of sequence analysis programs for the VAX. 
Nucleic Acids Research, 12, 387-395. 

8. Pearson, W.R. (1994) Using the FASTA program to search 
protein and DNA sequence databases. Methods in Molecular 
Biology, 25, 365-389. 



WO 98/13521 PCT/EP97/05290 

9. Sambrook, J. , Fritsch, E.F. and Maniatis, T. (1989) 
Molecular Cloning: A Laboratory Manual. Cold Spring Harbor. 

10. Altschul, S.F., Gish, W. , Miller, W., Myers, E.W. and 
Lipman, D.J. (1990) Basic local alignment search tool. J. 
Mol . Biol., 215, 403-410. 

11. Roberts, L. (1991), Science, 254, 805. 

12. Lopez R., Larsen F. , Pryd 2. (1994), Genomics, 24, 133- 
136. 



WO 98/13521 



PCT/EP97/05290 



2? 

CLAIMS 

1. A method for the differential screening of gene 
expression in biological samples by means of random priming 
RT-PCR, characterized in that the PCR is carried out using a 
plurality of oligonucleotide primers the sequence of which 
has been determined by a method comprising the following 
steps: 

a) generation of random primer sequences having a CG/AT 
ratio of 2:1, no stop codon, no more than three 
consecutive identical nucleotides and no palindromic 5 1 
and 3 ' ends ; 

b) screening of the primer sequences generated in a) by 
simulating PCR reactions on non-redundant mammalian 
nucleotide sequence databank entries containing at 
least 1,000 bp of coding region and calculating for 
each primer sequence their: 

(i) efficiency index, said efficiency index being 
defined as the ratio of the number of PCR products 
comprising coding sequences obtained using said 
primer sequence to the modal number of PCR 
products comprising coding sequences obtained for 
each of the whole set of tested primers generated 
in a) ; and 

(ii) selectivity index, said selectivity index being 
defined as the ratio between the probabilities of 
yielding a PCR product comprising coding sequences 
or 3' untranslated regions; and 
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c) selecting some or all of the primer sequences screened 
in b) according to their efficiency index and 
selectivity index for use in PCR. 

2. A method according to claim 1 wherein the 
oligonucleotide primers consists of 12 nucleotides. 

3 . A method according to claim 2 wherein the 
oligonucleotides primers consists of 8 C or G and 4 A or T. 

4. A method according to any one of the previous claims 
wherein each oligonucleotide primer differs from other 
primers in at least 5 out of 8 bases at the 3* end. 

5. A method according to any one of the previous claims 
wherein the simulated PCR reaction is carried out on non- 
redundant human or mouse data banks from which variable 
regions of immunoglobulins and T-cell receptors as well as 
intronic regions are eliminated, 

6. A method according to any one of the previous claims 
wherein the oligonucleotide primers have an efficiency index 
between 2 and 10, and a selectivity index higher than 1. 

7. A method according to any one of previous claims, 
wherein the oligonucleotide primers are partially degenerate 
at the last position of the 3' end. 

8. A kit for differential screening of gene expression in 
biological samples by means of random priming RT-PCR 
comprising: 

a) a plurality of oligonucleotide primers selected 
according to the method of claim 1; 

b) reagents for the reverse transcription and 
amplification reactions ; 
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c) optionally, protocols for the cloning of the products 

of differential screening. 
9. A kit according to claim 8 comprising the 
oligonucleotide primers of Table 1. 
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